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Abstract: A challenging problem in estimating high-dimensional graphical models is 
to choose the regularization parameter in a data-dependent way. The standard tech- 
niques include K-fo\d cross-validation (K-CV), Akaike information criterion (AIC), 
and Bayesian information criterion (BIC). Though these methods work well for low- 
dimensional problems, they are not suitable in high dimensional settings. In this paper, 
we present StARS: a new stability-based method for choosing the regularization pa- 
rameter in high dimensional inference for undirected graphs. The method has a clear 
interpretation: we use the least amount of regularization that simultaneously makes a 
graph sparse and rcplicable under random sampling. This interpretation requires essen- 
tially no conditions. Under mild conditions, we show that StARS is partially sparsistent 
in terms of graph estimation: i.e. with high probability, all the true edges will be in- 
cluded in the selected model even when the graph size diverges with the sample size. 
Empirically, the performance of StARS is compared with the state-of-the-art model 
selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a 
real microarray dataset. StARS outperforms all these competing procedures. 

Keywords and phrases: regularization selection, stability, cross validation, Akaike 
information criterion, Bayesian information criterion, partial sparsistency. 
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1. Introduction 

Undirected graphical models have emerged as a useful tool because they allow for a stochas- 
tic description of complex associations in high- dimensional data. For example, biological 
processes in a cell lead to complex interactions among gene products. It is of interest to de- 
termine which features of the system are conditionally independent. Such problems require 
us to infer an undirected graph from i.i.d. observations. Each node in this graph corresponds 
to a random variable and the existence of an edge between a pair of nodes represent their 
conditional independence relationship. 

Gaussian graphical models (Dempster, 1972; Edwards, 1995; Lauritzen, 1996; Whittaker, 
1990) are by far the most popular approach for learning high dimensional undirected graph 
structures. Under the Gaussian assumption, the graph can be estimated using the sparsity 
pattern of the inverse covariance matrix. If two variables are conditionally independent, the 
corresponding element of the inverse covariance matrix is zero. In many applications, es- 
timating the the inverse covariance matrix is statistically challenging because the number 
of features measured may be much larger than the number of collected samples. To han- 
dle this challenge, the graphical lasso or glasso (Banerjee, Ghaoui and d'Aspremont, 2008; 
Friedman, Hastie and Tibshirani, 2007; Yuan and Lin, 2007) is rapidly becoming a popu- 
lar method for estimating sparse undirected graphs. To use this method, however, the 
user must specify a regularization parameter A that controls the sparsity of the graph. 
The choice of A is critical since different A's may lead to different scientific conclusions 
of the statistical inference. Other methods for estimating high dimensional graphs include 
(Liu, Lafferty and Wasserman, 2009; Meinshausen and Biihlmann, 2006; Peng et ai, 2009). 
They also require the user to specify a regularization parameter. 

The standard methods for choosing the regularization parameter are AIC (Akaike, 1973), 
BIG (Schwarz, 1978) and cross validation (Efron, 1982). Though these methods have good 
theoretical properties in low dimensions, they are not suitable for high dimensional problems. 
In regression, cross-validation has been shown to overfit the data (Wasserman and Roeder, 
2009). Likewise, AIG and BIG tend to perform poorly when the dimension is large relative 
to the sample size. Our simulations confirm that these methods perform poorly when used 
with glasso. 

A new approach to model selection, based on model stability, has recently generated some 
interest in the literature (Lange et ai, 2004). The idea, as we develop it, is based on subsam- 
pling (Politis, Romano and Wolf, 1999) and builds on the approach of Meinshausen and Biihlmann 
(2010). We draw many random subsamples and construct a graph from each subsample (un- 
like K-fold cross-validation, these subsamples are overlapping). We choose the regularization 
parameter so that the obtained graph is sparse and there is not too much variability across 
subsamples. More precisely, we start with a large regularization which corresponds to an 
empty, and hence highly stable, graph. We gradually reduce the amount of regularization 
until there is a small but acceptable amount of variability of the graph across subsamples. In 
other words, we regularize to the point that we control the dissonance between graphs. The 
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procedure is named StARS: Stability Approach to Regularization Selection. We study the 
performance of StARS by simulations and theoretical analysis in Sections 4 and 5. Although 
we focus here on graphical models, StARS is quite general and can be adapted to other 
settings including regression, classification, clustering, and dimensionality reduction. 

In the context of clustering, results of stability methods have been mixed. Weaknesses of 
stability have been shown in (Ben-david, Luxburg and Pal, 2006). However, the approach 
was successful for density-based clustering (Rinaldo and Wasserman, 2009). For graph se- 
lection, Meinshausen and Biihlmann (2010) also used a stability criterion; however, their 
approach differs from StARS in its fundamental conception. They use subsampling to pro- 
duce a new and more stable regularization path then select a regularization parameter from 
this newly created path, whereas we propose to use subsampling to directly select one reg- 
ularization parameter from the original path. Our aim is to ensure that the selected graph 
is sparse, but inclusive, while they aim to control the familywise type I errors. As a conse- 
quence, their goal is contrary to ours: instead of selecting a larger graph that contains the 
true graph, they try to select a smaller graph that is contained in the true graph. As we will 
discuss in Section 3, in specific application domains like gene regulatory network analysis, 
our goal for graph selection is more natural. 

In Section 2 we review the basic notion of estimating high dimensional undirected graphs; 
in Section 3 we develop StARS; in Section 4 we present a theoretical analysis of the proposed 
method; and in Section 5 we report experimental results on both simulated data and a gene 
microarray dataset, where the problem is to construct gene regulatory network based on 
natural variation of the expression levels of human genes. 

2. Estimating a High-dimensional Undirected Graph 

Let X = (X(l), . . . , X(p)) be a random vector with distribution P. The undirected graph 
G = (y,E) associated with P has vertices V = {X{1), . . . ,X{p)} and a set of edges E 
corresponding to pairs of vertices. In this paper, we also interchangeably use E to denote 
the adjacency matrix of the graph G. The edge corresponding to X{j) and X{k) is absent 
if X{j) and X{k) are conditionally independent given the other coordinates of X. The 
graph estimation problem is to infer E from i.i.d. observed data Xi, . . . ,Xn where Xi = 
(X,(l),...,X,(p))^. 

Suppose now that P is Gaussian with mean vector fi and covariance matrix S. Then the 
edge corresponding to X{j) and X{k) is absent if and only if Qjk = where Q = 
Hence, to estimate the graph we only need to estimate the sparsity pattern of Q. When 
p could diverge with n, estimating Q is difficult. A popular approach is the graphical lasso 
or glasso (Banerjee, Ghaoui and d'Aspremont, 2008; Friedman, Hastie and Tibshirani, 2007; 
Yuan and Lin, 2007). Using glasso, we estimate Q as follows: Ignoring constants, the log- 
likelihood (after maximizing over /i) can be written as 

£{n) = \og\n\ -trace(Sl]) 
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where S is the sample covariance matrix. With a positive regular izat ion parameter A, the 
glasso estimator Q{X) is obtained by minimizing the regularized negative log-likelihood 

n{X) = aTgmm\-i{n) + X\\n\\i} (2.1) 

where = Y2jk\^jk\ is the elementwise £i-norm of H. The estimated graph G{X) = 

{V,E{X)) is then easily obtained from il{X): for i ^ j, an edge G E{X) if and only if 
the corresponding entry in f2(A) is nonzero. Friedman, Hastie and Tibshirani (2007) give a 
fast algorithm for calculating n(A) over a grid of As ranging from small to large. By taking 
advantage of the fact that the objective function in (2.1) is convex, their algorithm itera- 
tively estimates a single row (and column) of in each iteration by solving a lasso regression 
(Tibshirani, 1996). The resulting regularization path f2(A) for all As has been shown to have 
excellent theoretical properties (Ravikumar et ai, 2009; Rothman et ai, 2008). For exam- 
ple, Ravikumar et al. (2009) show that, if the regularization parameter A satisfies a certain 
rate, the corresponding estimator f2(A) could recover the true graph with high probability. 
However, these types of results are either asymptotic or non-asymptotic but with very large 
constants. They are not practical enough to guide the choice of the regularization parameter 
A in finite-sample settings. 

3. Regularization Selection 

In Equation (2.1), the choice of A is critical because A controls the sparsity level of G{X). 
Larger values of A tend to yield sparser graphs and smaller values of A yield denser graphs. It 
is convenient to define A = 1/ A so that small A corresponds to a more sparse graph. In par- 
ticular, A = corresponds to the empty graph with no edges. Given a grid of regularization 
parameters Qn = {Ai, . . . , A^}, our goal of graph regularization parameter selection is to 
choose one A G such that the true graph E is contained in E{A) with high probability. In 
other words, we want to "overselect" instead of "underselect" . Such a choice is motivated by 
application problems like gene regulatory networks reconstruction, in which we aim to study 
the interactions of many genes. For these types of studies, we tolerant some false positives 
but not false negatives. Specifically, it is acceptable that an edge presents but the two genes 
corresponding to this edge do not really interact with each other. Such false positives can 
generally be screened out by more fine-tuned downstream biological experiments. However, 
if one important interaction edge is omitted at the beginning, it's very difficult for us to 
re-discovery it by follow-up analysis. There is also a tradeoff: we want to select a denser 
graph which contains the true graph with high probability. At the same time, we want the 
graph to be as sparse as possible so that important information will not be buried by mas- 
sive false positives. Based on this rationale, an "underselect" method, like the approach of 
Meinshausen and Biihlmann (2010), does not really fit our goal. In the following, we start 
with an overview of several state-of-the-art regularization parameter selection methods for 
graphs. We then introduce our new StARS approach. 
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3.1. Existing Methods 

The regularization parameter is often chosen using AIC or BIG. Let VL[K) denote the esti- 
mator corresponding to A. Let d{A) denote the degree of freedom (or the effective number 
of free parameters) of the corresponding Gaussian model AIG chooses A as 

(AIG) A = argmin{-2£(fi(A)) + 2rf(A)}, (3.1) 

and BIG chooses A as 

(BIG) A = argmin{-2£(fi(A)) + d(A) ■ logn}. (3.2) 

The usual theoretical justification for these methods assumes that the dimension p is fixed 
as n increases; however, in the case where p > n this justification is not applicable. In fact, 
it's even not straightforward how to estimate the degree of freedom d{A) when p is larger 
than n . A common practice is to calculate d{A) as d{A) = m(A)(m(A) — l)/2 + p where 
m(A) denotes the number of nonzero elements of n(A). As we will see in our experiments, 
AIG and BIG tend to select overly dense graphs in high dimensions. 

Another popular method is i^-fold cross-validation (K-CV). For this procedure the data 
is partitioned into K subsets. Of the K subsets one is retained as the validation data, and 
the remaining K — 1 ones are used as training data. For each A G we estimate a graph 
on the K — 1 training sets and evaluate the negative log-likelihood on the retained validation 
set. The results are averaged over all K folds to obtain a single GV score. We then choose 
A to minimize the GV score over he whole grid Qn- In regression, cross-validation has been 
shown to overfit (Wasserman and Roeder, 2009). Our experiments will confirm this is true 
for graph estimation as well. 

3.2. StARS: Stability Approach to Regularization Selection 

The StARS approach is to choose A based on stability. When A is 0, the graph is empty 
and two datasets from P would both yield the same graph. As we increase A, the variability 
of the graph increases and hence the stability decreases. We increase A just until the point 
where the graph becomes variable as measured by the stability. StARS leads to a concrete 
rule for choosing A. 

Let b = h{n) be such that 1 < h{n) < n. We draw random subsamples Si,...,Sn 
from Xi, . . . each of size h. There are (^) such subsamples. Theoretically one uses all 
(^) subsamples. However, Po litis, Romano and Wolf (1999) show that it suffices in practice 
to choose a large number of subsamples at random. Note that, unlike bootstrapping 
(Efron, 1982), each subsample is drawn without replacement. For each A G Qn-, we construct 
a graph using the glasso for each subsample. This results in A^ estimated edge matrices 
E\{A)^ . . . , £'^(A). Focus for now on one edge (s, t) and one value of A. Let V'^(') denote the 
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glasso algorithm with the regularization parameter A. For any subsample Sj let ip^^{Sj) = 1 
if the algorithm puts an edge and i^^ti^j) = if the algorithm does not put an edge between 
(s, t). Define 

e':,{A) = n^iJ^,{x^,...,x,) = i). 

To estimate ^^^(A), we use a U-statistic of order b, namely, 

N 



Now define the parameter ^^{A) = 2^,^,(A)(1 -^,^,(A)) and let ^^(A) = 2^,(A)(1 - ^,(A)) 
be its estimate. Then ^^^(A), in addition to being twice the variance of the Bernoulli indicator 
of the edge (s, t), has the following nice interpretation: For each pair of graphs, we can ask how 
often they disagree on the presence of the edge: ■C^j (A) is the fraction of times they disagree. 
For A G we regard Cst(^) ^ measure of instability of the edge across subsamples, with 
< e(A) < 1/2. 

Define the total instability by averaging over all edges: 

A(A) 



Clearly on the boundary -Dfe(O) = 0, and Dh{A) generally will increase as A increases. 
However, when A gets very large, all the graphs will become dense and -Df,(A) will be- 
gin to decrease. Subsample stability for large A is essentially an artifact. We are inter- 
ested in stability for sparse graphs not dense graphs. For this reason we monotonize Df,{A) 
by defining Db{A) = supo<t<A -Dfe(t). Finally, our StARS approach chooses A by defining 
As = sup I A : Db{A) ^ /^j for a specified cut point value (3. 

It may seem that we have merely replaced the problem of choosing A with the problem 
of choosing /3, but /3 is an interpretable quantity and we always set a default value (3 = 0.05. 
One thing to note is that all quantities E,6,C,,D depend on the subsampling block size b. 
Since StARS is based on subsampling, the effective sample size for estimating the selected 
graph is b instead of n. Compared with methods like BIC and AIC which fully utilize all n 
data points. StARS has some efficiency loss in low dimensions. However, in high dimensional 
settings, the gain of StARS on better graph selection significantly dominate this efficiency 
loss. This fact is confirmed by our experiments. 

4. Theoretical Properties 

The StARS procedure is quite general and can be applied with any graph estimation al- 
gorithms. Here, we provide its theoretical properties. We start with a key theorem which 
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establishes the rates of convergence of the estimated stabihty quantities to their popula- 
tion means. We then discuss the implication of this theorem on general gaph regularization 
selection problems. 

Let A be an element in the grid Qn = {Ai, . . . , Ax} where i^' is a polynomial of n. We 
denote Db{A) = E(Df,(A)). The quantity ^((A) is an estimate of Cst(^) ^^cl -Db(A) is an 
estimate of Dh{A). Standard ^/-statistic theory guarantees that these estimates have good 
uniform convergence properties to their population quantities: 

Theorem 4.1. (Uniform Concentration) The following statements hold with no assumptions 
on P. For any 6 G (0, 1), with probability at least 1 — S, we have 



VA e a„. max itW - ?.',(A)I < Jl!M^i2ii^±i2lM)) . (4.1) 

s<t V n 



ma.|fl.(A) - fl.(A)| < /l^Mlog A' + 41ogp + log (1/^)) ^ ^^^^^ 

Proof. Note that ^^^(A) is a [/-statistic of order b. Hence, by Hoeffding's inequality for U- 
statistics (Serfling, 1980), we have, for any e > 0, 

P(|^,(A) - ^,^(A)| > 6) < 2exp {-2neyb) . (4.3) 

Now C,st{A) is just a function of the [/-statistic 6'^^ (A). Note that 

-^.(A))-^.^(A)(l-e,^(A))| (4.4) 

^*(A))'-e(A) + (^.^(A))'| (4.5) 

^]*(A)-^.^(A)|+2|(^,(A))^- (^,^(A))^| (4.6) 

^*(A) - e(A)| + 2|(^,(A) - e(A))(^,(A) + e(A))| (4.7) 

^*(A)-^.^(A)|+4|^,(A)-^,^(A)| (4.8) 

^.^(A)|, (4.9) 



lat(A) - 


C^(A)| = 2|^,(A)(1- 




= 2|^,(A)- 




< 2|^,(A)- 




< 2|^,(A)- 




< 2|^,(A)- 




= 6|^,(A)- 


we have |g,(A) 


-e(A)| <6|^,(A)-^ 



edges, we obtain: for each A E Qn, 

P(max|g,(A) -C^(A)| > 6e) < 2/exp {-2ne'/b) . (4.10) 

s<t 

Using two union bound arguments over the K values of A and all the p{p — l)/2 edges, we 
have: 

¥ La^\D,{A) - D,iA)\ > e) < |^„,| ■ ■ P(max [^(A) - e.^(A)| > e) (4.11) 

J 2 s<t 

< /s:-p^-exp(-neV(186)). (4.12) 

Equations (4.1) and (4.2) follow directly from (4.10) and the above exponential probability 
inequality. □ 
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Theorem 4.1 allows us to explicitly characterize the high-dimensional scaling of the sample 
size n, dimensionality p, subsampling block size b, and the grid size K. More specifically, we 
get 



by setting 6 = 1/n in Equation (4.2). From (4.13), let ci,C2 be arbitrary positive constants, 
if 6 = Cia/ti, K = n'^^, and p < exp (n"^) for some 7 < 1/2, the estimated total stability 
Db{A) still converges to its mean Db{A) uniformly over the whole grid Qn- 

We now discuss the implication of Theorem 4.1 to graph regularization selection problems. 
Due to the generality of StARS, we provide theoretical justifications for a whole family 
of graph estimation procedures satisfying certain conditions. Let ip he a. graph estimation 
procedure. We denote E'^{A) as the estimated edge set using the regularization parameter A 
by applying ip on a. subsampled dataset with block size b. To establish graph selection result, 
we start with two technical assumptions: 

(Al) 3Ao G Qn, such that maxA<A„AAGe„ -Db(A) < f3/2 for large enough n. 
(A2) For any A G Gn and A > Ao, C E^A)) ^ 1 as n ^ 00. 

Note that Aq here depends on the sample size n and does not have to be unique. To un- 
derstand the above conditions, (Al) assumes that there exists a threshold Aq G Qn, such 
that the population quantity -Db(A) is small for all A < Aq. (A2) requires that all estimated 
graphs using regularization parameters A > Aq contain the true graph with high probability. 
Both assumptions are mild and should be satisfied by most graph estimation algorithm with 
reasonable behaviors. There is a tradeoff on the design of the subsampling block size b . To 
make (A2) hold, we require b to be large. However, to make Db{A) concentrate to Db{A) 
fast, we require b to be small. Our suggested value is 6 = [lOy^J, which balances both the 
theoretical and empirical performance well. The next theorem provides the graph selection 
performance of StARS: 

Theorem 4.2. (Partial Sparsistency): Let to be a graph estimation algorithm. We assume 
(Al) and (A2) hold for ip using b = [10^/n\ and \Qn\ = K = n'^^ for some constant ci > 
0. Let As G Qn be the selected regularization parameter using the StARS procedure with a 
constant cutting point /3. Then, if p < exp {n'^) for some ''j < 1/2, we have 



Proof. We define An to be the event that maxAeg„ \Db{A) — Db{A)\ < 13/2. The scaling of 
n,K,b,p in the theorem satisfies the L.H.S. of (4.13), which implies that F{An) — 1 as 
n — )■ 00. 

Using (Al), we know that, on An, 




(4.13) 



C E\As)) ^1 as n-^ 00. 



(4.14) 




, Db{A) < (3. 

'n 



(4.15) 
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This implies that, on An, As > Aq. The result follows by applying (A2) and a union bound. 

□ 

5. Experimental Results 

We now provide empirical evidence to illustrate the usefulness of StARS and compare it 
with several state-of-the-art competitors, including 10-fold cross-validation (K-CV), BIG, 
and AIC For StARS we always use subsampling block size b{n) = [10 ■ ^/n\ and set the 
cut point /3 = 0.05. We first quantitatively evaluate these methods on two types of synthetic 
datasets, where the true graphs are known. We then illustrate StARS on a microarray dataset 
that records the gene expression levels from immortalized B cells of human subjects. On all 
high dimensional synthetic datasets, StARS significantly outperforms its competitors. On the 
microarray dataset, StARS obtains a remarkably simple graph while all competing methods 
select what appear to be overly dense graphs. 

5.1. Synthetic Data 

To quantitatively evaluate the graph estimation performance, we adapt the criteria including 
precision, recall, and Fi-score from the information retrieval literature. Let G = (V, E) be a 
p-dimensional graph and let G = {V, E) be an estimated graph. We define 

\E n E\ \E f] E\ precision ■ recall 

precision = — — , recall = — — — , ^i-score = 2 ■ -. (5.1) 

1^1 \E\ precision + recall 

In other words. Precision is the number of correctly estimated edges divided by the total 
number of edges in the estimated graph; recall is the number of correctly estimated edges 
divided by the total number of edges in the true graph; the Fi-score can be viewed as a 
weighted average of the precision and recall, where an Fi-score reaches its best value at 1 
and worst score at 0. On the synthetic data where we know the true graphs, we also compare 
the previous methods with an oracle procedure which selects the optimal regularization pa- 
rameter by minimizing the total number of different edges between the estimated and true 
graphs along the full regularization path. Since this oracle procedure requires the knowledge 
of the truth graph, it is not a practical method. We only present it here to calibrate the 
inherent challenge of each simulated scenario. To make the comparison fair, once the regu- 
larization parameters are selected, we estimate the oracle and StARS graphs only based on 
a subsampled dataset with size 

b{n) = [lOv^J. 

In contrast, the K-CV, BIG, and AIG graphs are estimated using the full dataset. More 
details about this issue were discussed in Section 3. 

We generate data from sparse Gaussian graphs, neighborhood graphs and hub graphs, 
which mimic characteristics of real-wolrd biological networks. The mean is set to be zero and 
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Q ^. For both graphs, the diagonal elements of Q are set to be 



from a unit square. We then 

2^ 



the covariance matrix S 
one. More specifically: 

1. Neighborhood graph: We first uniformly sample yi, . . . ,y. 
set = ^Iji = p with probability (a/^vt) exp (— 4||yj — yj\\^)- All the rest Vtij are 
set to be zero. The number of nonzero off-diagonal elements of each row or column is 
restricted to be smaller than [1/pJ. In this paper, p is set to be 0.245. 

2. Huh graph: The rows/columns are partitioned into J equally-sized disjoint groups: 
Vl U V2 . . . U Vj = {1, . . . each group is associated with a "pivotal" row k. Let 
\Vi\ = s. We set Qik = ^ki = p for z G and flik = ^ki = otherwise. In our 
experiment, J = [p/s\ , /c = 1, s + 1, 2s + 1, . . ., and we always set p = + 1) with 
s = 20. 

We generate synthetic datasets in both low-dimensional (n = 800, p = 40) and high- 
dimensional (n = 400, p = 100) settings. Table 1 provides comparisons of all methods, where 
we repeat the experiments 100 times and report the averaged precision, recall, Fi-score with 
their standard errors. 

Table 1 

Quantitative comparison of different methods on the datasets from the neighborhood and hub 

graphs. 



Neighborhood graph: n =800, p=40 
Methods Precision Recall 



Neighborhood graph: n=400, p =100 



Fi -score 



Precision 



Recal 



Fi-score 



Oracle 

StARS 

K-CV 

BIC 

AlC 



0.9222 (0.05) 
0.7204 (0.08) 
0.1394 (0.02) 
0.9738 (0.03) 
0.8696 (0.11) 



0.9070 (0.07) 
0.9530 (0.05) 
1.0000 (0.00) 
0.9948 (0.02) 
0.9996 (0.01) 



0.9119 (0.04) 
0.8171 (0.05) 
0.2440 (0.04) 
0.9839 (0.01) 
0.9236 (0.07) 



0.7473 (0.09) 
0.6366 (0.07) 
0.1383 (0.01) 
0.1796 (0.11) 
0.1279 (0.00) 



0.8001 (0.06) 
0.8718 (0.06) 
1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 



0.7672 (0.07) 
0.7352 (0.07) 
0.2428 (0.01) 
0.2933 (0.13) 
0.2268 (0.01) 



Hub graph: n =800, p=40 



Hub graph: n=400, p =100 



Methods 



Precision 



Recal 



Fi -score 



Precision 



Recal 



Fi-score 



Oracle 

StARS 

K-CV 

BIC 

AlC 



0.9793 (0.01) 
0.4377 (0.02) 
0.2383 (0.09) 
0.4879 (0.05) 
0.2522 (0.09) 



1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 



0.9895 (0.01) 
0.6086 (0.02) 
0.3769 (0.01) 
0.6542 (0.05) 
0.3951 (0.00) 



0.8976 (0.02) 
0.4572 (0.01) 
0.1574 (0.01) 
0.2155 (0.00) 
0.1676 (0.00) 



1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 
1.0000 (0.00) 



0.9459 (0.01) 
0.6274 (0.01) 
0.2719 (0.00) 
0.3545 (0.01) 
0.2871 (0.00) 



For low-dimensional settings where n ^ p, the BIC criterion is very competitive and 
performs the best among all the methods. In high dimensional settings, however, StARS 
clearly outperforms all the competing methods for both neighborhood and hub graphs. This 
is consistent with our theory. At first sight, it might be surprising that for data from low- 
dimensional neighborhood graphs, BIC and AIC even outperform the oracle procedure! This 
is due to the fact that both BIC and AIC graphs are estimated using all the n = 800 data 
points, while the oracle graph is estimated using only the subsampled dataset with size 
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b{n) = [10 ■ ^/n\ = 282. Direct usage of the full sample is an advantage of model selection 
methods that take the general form of BIG and AIC In high dimensions, however, we see 
that even with this advantage, StARS clearly outperforms BIG and AIG. The estimated 
graphs for different methods in the setting n = 400, p = 100 are provided in Figures 1 and 2, 
from which we see that the StARS graph is almost as good as the oracle, while the K-CV, 
BIG, and AIG graphs are overly too dense. 





(a) True graph 



(b) Oracle graph 



(c) StARS graph 




(d) K-CV graph 



(e) BIG graph 



(f) AIC graph 



Fig 1. Comparison of different methods on the data from the neighborhood graphs (n = 400, p 

loo;. 



5.2. Microarray Data 

We apply StARS to a dataset based on Affymetrix GeneGhip microarrays for the gene ex- 
pression levels from immortalized B cells of human subjects. The sample size is n = 294. The 
expression levels for each array are pre-processed by log-transformation and standardization 
as in (Nayak et ai, 2009). Using a previously estimated sub-pathway subset containing 324 
genes (Liu et ai, 2010), we study the estimated graphs obtained from each method under 
investigation. The StARS and BIG graphs are provided in Figure 3. We see that the StARS 
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(a) True graph (b) Oracle graph (c) StARS graph 




(d) K-C\ graph (e) BIG graph (f) AIC graph 

Fig 2. Comparison of different methods on the data from the huh graphs (n = 400, p = 100/ 



graph is remarkably simple and informative, exhibiting some cliques and hub genes. In con- 
trast, the BIG graph is very dense and possible useful association information is buried in the 
large number of estimated edges. The selected graphs using AIC and K-CV are even more 
dense than the BIC graph and is omitted here. A full treatment of the biological implication 
of these two graphs validated by enrichment analysis will be left as a future study. 

6. Conclusions 

The problem of estimating structure in high dimensions is very challenging. Casting the 
problem in the context of a regularized optimization has led to some success, but the choice 
of the regularization parameter is critical. We present a new method, StARS, for choosing 
this parameter in high dimensional inference for undirected graphs. Like Meinshausen and 
Biihlmann's stability selection approach (Meinshausen and Biihlmann, 2010), our method 
makes use of subsampling, but it differs substantially from their approach in both imple- 
mentation and goals. For graphical models, we choose the regularization parameter directly 
based on the edge stability. Under mild conditions, StARS is partially sparsistent. However, 
even without these conditions, StARS has a simple interpretation: we use the least amount 
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(a) StARS graph (b) BIC graph 



Fig 3. Microarray data example. The StARS graph is more informative graph than the BIC graph. 

of regularization that simultaneously makes a graph sparse and replicable under random 
sampling. 

Empirically, we show that StARS works significantly better than existing techniques on 
both synthetic and microarray datasets. Although we focus here on graphical models, our new 
method is generally applicable to many problems that involve estimating structure, including 
regression, classification, density estimation, clustering, and dimensionality reduction. 
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